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Abstract 

Sigmoid  type  belief  networks,  a  class  of  probabilistic  neural  networks,  provide  a  natural  framework  for 
compactly  representing  probabilistic  information  in  a  variety  of  unsupervised  and  supervised  learning 
problems.  Often  the  parameters  used  in  these  networks  need  to  be  learned  from  examples.  Unfortunately, 
estimating  the  parameters  via  exact  probabilistic  calculations  (i.e,  the  EM-algorithm)  is  intractable  even 
for  networks  with  fairly  small  numbers  of  hidden  units.  We  propose  to  avoid  the  infeasibility  of  the  E  step 
by  bounding  likelihoods  instead  of  computing  them  exactly.  We  introduce  extended  and  complementary 
representations  for  these  networks  and  show  that  the  estimation  of  the  network  parameters  can  be  made 
fast  (reduced  to  quadratic  optimization)  by  performing  the  estimation  in  either  of  the  alternative  domains. 
The  complementary  networks  can  be  used  for  continuous  density  estimation  as  well. 
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1  Introduction 

The  appeal  of  probabilistic  networks  for  knowledge  rep¬ 
resentation,  inference,  and  learning  (Pearl,  1988)  derives 
both  from  the  sound  Bayesian  framework  and  from  the 
explicit  representation  of  dependencies  among  the  net¬ 
work  variables  which  allows  ready  incorporation  of  prior 
information  into  the  design  of  the  network.  The  Bayesian 
formalism  permits  full  propagation  of  probabilistic  infor¬ 
mation  across  the  network  regardless  of  which  variables 
in  the  network  are  instantiated.  In  this  sense  these  net¬ 
works  can  be  “inverted”  probabilistically. 

This  inversion,  however,  relies  heavily  on  the  use  of 
look-up  table  representations  of  conditional  probabili¬ 
ties  or  representations  equivalent  to  them  for  modeling 
dependencies  between  the  variables.  For  sparse  depen¬ 
dency  structures  such  as  trees  or  chains  this  poses  no 
difficulty.  In  more  realistic  cases  of  reasonably  inter¬ 
dependent  variables  the  exact  algorithms  developed  for 
these  belief  networks  (Lauritzen  &  Spiegelhalter,  1988) 
become  infeasible  due  to  the  exponential  growth  in  the 
size  of  the  conditional  probability  tables  needed  to  store 
the  exact  dependencies.  Therefore  the  use  of  compact 
representations  to  model  probabilistic  interactions  is  un¬ 
avoidable  in  large  problems.  As  belief  network  models 
move  away  from  tables,  however,  the  representations  can 
be  harder  to  assess  from  expert  knowledge  and  the  im¬ 
portant  role  of  learning  is  further  emphasized. 

Compact  representations  of  interactions  between  sim¬ 
ple  units  have  long  been  emphasized  in  neural  networks. 
Lacking  a  thorough  probabilistic  interpretation,  how¬ 
ever,  classical  feed-forward  neural  networks  cannot  be 
inverted  in  the  above  sense;  e.g.  given  the  output  pat¬ 
tern  of  a  feed-forward  neural  network  it  is  not  feasible 
to  compute  a  probability  distribution  over  the  possible 
input  patterns  that  would  have  resulted  in  the  observed 
output.  On  the  other  hand,  stochastic  neural  networks 
such  as  Boltzman  machines  admit  probabilistic  interpre¬ 
tations  and  therefore,  at  least  in  principle,  can  be  in¬ 
verted  and  used  as  a  basis  for  inference  and  learning  in 
the  presence  of  uncertainty. 

Sigmoid  belief  networks  (Neal,  1992)  form  a  subclass 
of  probabilistic  neural  networks  where  the  activation 
function  has  a  sigmoidal  form  -  usually  the  logistic  func¬ 
tion.  Neal  (1992)  proposed  a  learning  algorithm  for  these 
networks  which  can  be  viewed  as  an  improvement  of 
the  algorithm  for  Boltzmann  machines.  Recently  Hin¬ 
ton  et  al.  (1995)  introduced  the  wake-sleep  algorithm 
for  layered  bi-directional  probabilistic  networks.  This 
algorithm  relies  on  forward  sampling  and  has  an  appeal¬ 
ing  coding  theoretic  motivation.  The  Helmholtz  machine 
(Dayan  et  ah,  1995),  on  the  other  hand,  can  be  seen 
as  an  alternative  technique  for  these  architectures  that 
avoids  Gibbs  sampling  altogether.  Dayan  et  al.  also 
introduced  the  important  idea  of  bounding  likelihoods 
instead  of  computing  them  exactly.  Saul  et  al.  (1995) 
subsequently  derived  rigorous  mean  held  bounds  for  the 
likelihoods.  In  this  paper  we  introduce  the  idea  of  alter¬ 
native  -  extended  and  complementary  -  representations 
of  these  networks  by  reinterpreting  the  nonlinearities  in 
the  activation  function.  We  show  that  deriving  likeli¬ 
hood  bounds  in  the  new  representational  domains  leads 


to  efficient  (quadratic)  estimation  procedures  for  the  net¬ 
work  parameters. 


2  The  probability  representations 

Belief  networks  represent  the  joint  probability  of  a  set 
of  variables  {S'}  as  a  product  of  conditional  probabilities 
given  by 

n 

R(Si,...,s„)  =  (1) 

k  =  l 


where  the  notation  pa[k\,  “parents  of  Sj,”,  refers  to  all 
the  variables  that  directly  influence  the  probability  of  Sk 
taking  on  a  particular  value  (for  equivalent  representa¬ 
tions,  see  Lauritzen  et  al.  1988).  The  fact  that  the  joint 
probability  can  be  written  in  the  above  form  implies  that 
there  are  no  “cycles”  in  the  network;  i.e.  there  exists  an 
ordering  of  the  variables  in  the  network  such  that  no 
variable  directly  influences  any  preceding  variables. 

In  this  paper  we  consider  sigmoid  belief  networks 
where  the  variables  S  are  binary  (0/1),  the  conditional 
probabilities  have  the  form 

R(S,|pa[i])  =  <,((25,  -  1)  ^  WyS,)  (2) 
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and  the  weights  Wij  are  zero  unless  Sj  is  a  parent  of 
Si ,  thus  preserving  the  feed-forward  directionality  of  the 
network.  For  notational  convenience  we  have  assumed 
the  existence  of  a  bias  variable  whose  value  is  clamped 
to  one.  The  activation  function  g{-)  is  chosen  to  be  the 
cumulative  Gaussian  distribution  function  given  by 


a{x) 


dz  = 


e-hS-^fdz 


.  .  (3) 

Although  very  similar  to  the  standard  logistic  func¬ 
tion,  this  activation  function  derives  a  number  of  ad¬ 
vantages  from  its  integral  representation.  In  particular, 
we  may  reinterpret  the  integration  as  a  marginalization 
and  thereby  obtain  alternative  representations  for  the 
network.  We  consider  two  such  representations. 

We  derive  an  extended  representation  by  making  ex¬ 
plicit  the  nonlinearities  in  the  activation  function.  More 
precisely. 


R(5,|pa[i])  =  ,,((25,  -  1)  ^  W,y5, ) 


def 


r  1  -|[Z.-(25.-l)y".tV.,5,T 

lo 
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This  suggests  dehning  the  extended  network  in  terms 
of  the  new  conditional  probabilities  P(Si,  Zi\];>a[i]).  By 
construction  then  the  original  binary  network  is  obtained 
by  marginalizing  over  the  extra  variables  Z .  In  this  sense 
the  extended  network  is  (marginally)  equivalent  to  the 
binary  network. 

We  distinguish  a  eomplementary  representation  from 
the  extended  one  by  writing  the  probabilities  entirely  in 


terms  of  continuous  variables^.  Such  a  representation 
can  be  obtained  from  the  extended  network  by  a  simple 
transformation  of  variables.  The  new  continuous  vari¬ 
ables  are  dehned  by  Zi  =  {‘ISi  —  \)Zi,  or,  equivalently, 
by  Zi  =  \Zi  \  and  Si  =  0{Zi)  where  6I(-)  is  the  step  func¬ 
tion.  Performing  this  transformation  yields 


^(ZilpaH) 


1 


(5) 


which  dehnes  a  network  of  conditionally  Gaussian  vari¬ 
ables.  The  original  network  in  this  case  can  be  recovered 
by  conditional  marginalization  over  Z  where  the  condi¬ 
tioning  variables  are  0{Z). 

Figure  1  below  summarizes  the  relationships  between 
the  different  representations.  As  will  become  clear  later, 
working  with  the  alternative  representations  instead  of 
the  original  binary  representation  can  lead  to  more  flex¬ 
ible  and  efhcient  (least-squares)  parameter  estimation. 


Figure  1:  The  relationship  between  the  alternative  rep¬ 
resentations. 


3  The  learning  problem 

We  consider  the  problem  of  learning  the  parameters  of 
the  network  from  instantiations  of  variables  contained 
in  a  training  set.  Such  instantiations,  however,  need  not 
be  complete;  there  may  be  variables  that  have  no  value 
assignments  in  the  training  set  as  well  as  variables  that 
are  always  instantiated.  The  tacit  division  between  hid¬ 
den  (H)  and  visible  (V)  variables  therefore  depends  on 
the  particular  training  example  considered  and  is  not  an 
intrinsic  property  of  the  network. 

To  learn  from  these  instantiations  we  adopt  the  princi¬ 
ple  of  maximum  likelihood  to  estimate  the  weights  in  the 
network.  In  essence,  this  is  a  density  estimation  prob¬ 
lem  where  the  weights  are  chosen  so  as  to  match  the 
probabilistic  behavior  of  the  network  with  the  observed 
activities  in  the  training  set.  Central  to  this  estimation  is 
the  ability  to  compute  likelihoods  (or  log-likelihoods)  for 
any  (partial)  conhguration  of  variables  appearing  in  the 
training  set.  In  other  words,  if  we  let  be  the  con- 
Rguration  of  visible  or  instantiated  variables^  and 
denote  the  hidden  or  uninstantiated  variables,  we  need 

^  While  the  binary  variables  are  the  outputs  of  each  unit 
the  continuous  variables  pertain  to  the  inputs  -  hence  the 
name  complementary. 

^To  postpone  the  issue  of  representation  we  use  X  to  de¬ 
note  S,  {S,Z],  or  Z  depending  on  the  particular  representa¬ 
tion  chosen. 


to  compute  marginal  probabilities  of  the  form 

logP{X^)  =  logJ2P{X^,X^)  (6) 

If  the  training  samples  are  independent,  then  these  log 
marginals  can  be  added  to  give  the  overall  log-likelihood 
of  the  training  set 

log /^(training  set)  =  E  logP(A^‘)  (7) 

t 

Unfortunately,  computing  each  of  these  marginal  proba¬ 
bilities  involves  summing  (integrating)  over  an  exponen¬ 
tial  number  of  different  configurations  assumed  by  the 
hidden  variables  in  the  network.  This  renders  the  sum 
(integration)  intractable  in  all  but  few  special  cases  (e.g. 
trees  and  chains).  It  is  possible,  however,  to  instead  find 
a  manageable  lower  bound  on  the  log-likelihood  and  op¬ 
timize  the  weights  in  the  network  so  as  to  maximize  this 
bound. 

To  obtain  such  a  lower  bound  we  resort  to  Jensen’s 
inequality: 


logP(A^)  = 
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logY,P(X^,X^) 

XH 


log^Q(A^) 

XH 


P{X^,X^) 

Q{XH) 


Y.Q{X^)log 

XH 


P{X^,X^) 

Q{XH) 


(8) 


Although  this  bound  holds  for  all  distributions  Q(X) 
over  the  hidden  variables,  the  accuracy  of  the  bound  is 
determined  by  how  closely  Q  approximates  the  posterior 
distribution  P(X^ \X^  )  in  terms  of  the  Kullback-Leibler 
divergence;  if  the  approximation  is  perfect  the  divergence 
is  zero  and  the  inequality  is  satisfied  with  equality.  Suit¬ 
able  choices  for  Q  can  make  the  bound  both  accurate 
and  easy  to  compute.  The  feasibility  of  finding  such  Q, 
however,  is  highly  dependent  on  the  choice  of  the  repre¬ 
sentation  for  the  network. 


4  Likelihood  bounds  in  different 
representations 

To  complete  the  derivation  of  the  likelihood  bound 
(equation  8)  we  need  to  fix  the  representation  for  the 
network.  Which  representation  to  select,  however,  af¬ 
fects  the  quality  and  accuracy  of  the  bound.  In  addi¬ 
tion,  the  accompanying  bound  of  the  chosen  represen¬ 
tation  implies  bounds  in  the  other  two  representational 
domains  as  they  all  code  the  same  distributions  over  the 
observables.  In  this  section  we  illustrate  these  points 
by  deriving  bounds  in  the  complementary  and  extended 
representations  and  discuss  the  corresponding  bounds  in 
the  original  binary  domain. 

Now,  to  obtain  a  lower  bound  we  need  to  specify  the 
approximate  posterior  Q.  In  the  complementary  rep¬ 
resentation  the  conditional  probabilities  are  Gaussians 
and  therefore  a  reasonable  approximation  (mean  held) 
is  found  by  choosing  the  posterior  approximation  from 


the  family  of  factorized  Gaussians: 

=  (9) 

i  ^ 

Substituting  this  into  equation  8  we  obtain  the  bound 

\ogP{S*)  >  -1  ^  (h,  )f 

i 

(10) 

ij 

The  means  hi  for  the  hidden  variables  are  adjustable  pa¬ 
rameters  that  can  be  tuned  to  make  the  bound  as  tight 
as  possible.  For  the  instantiated  variables  we  need  to 
enforce  the  constraints  g{hi)  =  S*  to  respect  the  in¬ 
stantiation.  These  can  be  satisRed  very  accurately  by 
setting  hi  =  4(25*  —  1).  A  very  convenient  property 
of  this  bound  and  the  complementary  representation  in 
general  is  the  quadratic  weight  dependence  -  a  property 
very  conducive  to  fast  learning.  Finally,  we  note  that  the 
complementary  representation  transforms  the  binary  es¬ 
timation  problem  into  a  continuous  density  estimation 
problem. 

We  now  turn  to  the  interpretation  of  the  above  bound 
in  the  binary  domain.  The  same  bound  can  be  obtained 
by  first  fixing  the  inputs  to  all  the  units  to  be  the  means 
hi  and  then  computing  the  negative  total  mean  squared 
error  between  the  fixed  inputs  and  the  corresponding 
probabilistic  inputs  propagated  from  the  parents.  The 
fact  that  this  procedure  in  fact  gives  a  lower  bound  on 
the  log-likelihood  would  be  more  difficult  to  justify  by 
working  with  the  binary  representation  alone. 

In  the  extended  representation  the  probability  distri¬ 
bution  for  Zi  is  a  truncated  Gaussian  given  Si  and  its 
parents.  We  therefore  propose  the  partially  factorized 
posterior  approximation: 

Q{S,Z)  =  l[Q{Zi\Si)Q{Si)  (11) 

i 

where  Q(Zi\Si)  is  a  truncated  Gaussian: 

Q(Zi\Si)  =  - - ^ (12) 

As  in  the  complementary  domain  the  resulting  bound 
depends  quadratically  on  the  weights.  Instead  of  writing 
out  the  bound  here,  however,  it  is  more  informative  to 
see  its  derivation  in  the  binary  domain. 

A  factorized  posterior  approximation  (mean  field) 
Q(S)  =  ])([j  for  the  binary  network  yields 

a  bound 

log4^(5*)>^(5Gogff(E,T.G)) 

i 

+  ^((l-5,)log(l-<,(E,J..5.))) 

i 

+  (1  “  ft)log(l  -  Qi)]  (13) 

i 

where  the  averages  (•)  are  with  respect  to  the  Q  distribu¬ 
tion.  These  averages,  however,  do  not  conform  to  analyt¬ 
ical  expressions.  The  tractable  posterior  approximation 

O 


in  the  extended  domain  avoids  the  problem  by  implicitly 
making  the  following  Legendre  transformation: 

logfi'(*)  =  +logfi'(*)]  - 

>  A*-G(A)-i*2  (14) 

which  holds  since  / 2  -f  log3(a;)  is  a  convex  function. 
Inserting  this  back  into  the  relevant  parts  of  equation  13 
and  performing  the  averages  gives 

log  5(5*)  >  ^[giAi  -  (1  -  gi)MYl'^S<li 

*  3 

—  ^[9jG(Ai)  -f  (1  —  qi)G{Xi)] 

i 

f  ^  Z  45i(l  -  9i) 

3  U 

+  (1  “  G)log(l  -  Qi)]  (15) 

i 

which  is  quadratic  in  the  weights  as  expected.  The  mean 
activities  q  for  the  hidden  variables  and  the  parameters 
A  can  be  optimized  to  make  the  bound  tight.  For  the 
instantiated  variables  we  set  qi  =  5* . 

5  Numerical  experiments 

To  test  these  techniques  in  practice  we  applied  the  com¬ 
plementary  network  to  the  problem  of  detecting  motor 
failures  from  spectra  obtained  during  motor  operation 
(see  Petsche  et  al.  1995).  We  cast  the  problem  as  a  con¬ 
tinuous  density  estimation  problem.  The  training  set 
consisted  of  800  out  of  1283  FFT  spectra  each  with  319 
components  measured  from  an  electric  motor  in  a  good 
operating  condition  but  under  varying  loads.  The  test 
set  included  the  remaining  483  FFTs  from  the  same  mo¬ 
tor  in  a  good  condition  in  addition  to  three  sets  of  1340 
FFTs  each  measured  when  a  particular  fault  was  present. 
The  goal  was  to  use  the  likelihood  of  a  test  FFT  with 
respect  to  the  estimated  density  to  determine  whether 
there  was  a  fault  present  in  the  motor. 

We  used  a  layered  6  ^  20  ^  319  generative  model  to 
estimate  the  training  set  density.  The  resulting  classifi¬ 
cation  error  rates  on  the  test  set  are  shown  in  figure  2  as  a 
function  of  the  threshold  likelihood.  The  achieved  error 
rates  are  comparable  to  those  of  Petsche  et  al.  (1995). 

6  Conclusions 

Network  models  that  admit  probabilistic  formulations 
derive  a  number  of  advantages  from  probability  theory. 
Moving  away  from  explicit  representations  of  dependen¬ 
cies,  however,  can  make  these  properties  harder  to  ex¬ 
ploit  in  practice.  We  showed  that  an  efficient  estimation 
procedure  can  be  derived  for  sigmoid  belief  networks, 
where  standard  methods  are  intractable  in  all  but  a  few 
special  cases  (e.g.  trees  and  chains).  The  efficiency  of 
our  approach  derived  from  the  combination  of  two  ideas. 
First,  we  avoided  the  intractability  of  computing  likeli¬ 
hoods  in  these  networks  by  computing  lower  bounds  in¬ 
stead.  Second,  we  introduced  new  representations  for 


Figure  2:  The  probability  of  error  curves  for  missing 
a  fault  (dashed  lines)  and  misclassifying  a  good  motor 
(solid  line)  as  a  function  of  the  likelihood  threshold. 


these  networks  and  showed  how  the  lower  bounds  in  the 
new  representational  domains  transform  the  parameter 
estimation  problem  into  quadratic  optimization. 

Acknowledgements: 

The  authors  wish  to  thank  Peter  Dayan  for  helpful  com¬ 
ments  on  the  manuscript. 

References 

P.  Dayan,  G.  Hinton,  R.  Neal,  and  R.  Zemel  (1995).  The 
helmholtz  machine.  Neural  Computation  7:  889-904. 

A.  Dempster,  N.  Laird,  and  D.  Rubin.  Maximum  likeli¬ 
hood  from  incomplete  data  via  the  EM  algorithm  (1977). 
J.  Roy.  Statist.  Soc.  B  39:1-38. 

G.  Hinton,  P.  Dayan,  B.  Frey,  and  R.  Neal  (1995).  The 
wake-sleep  algorithm  for  unsupervised  neural  networks. 
Science  268:  1158-1161. 

S.  L.  Lauritzen  and  D.  J.  Spiegelhalter  (1988).  Local 
computations  with  probabilities  on  graphical  structures 
and  their  application  to  expert  systems.  J.  Roy.  Statist. 
Soc.  B  50:154-227. 

R.  Neal.  Gonnectionist  learning  of  belief  networks 
(1992).  Artificial  Intelligence  56:  71-113. 

J.  Pearl  (1988).  Probabilistic  Reasoning  in  Intelligent 
Systems.  Morgan  Kaufmann:  San  Mateo. 

T.  Petsche,  A.  Marcantonio,  G.  Darken,  S.  J.  Hanson, 
G.  M.  Kuhn,  1.  Santoso  (1995).  A  neural  network  au- 
toassociator  for  induction  motor  failure  prediction.  In 
Advances  in  Neural  Information  Processing  Systems  8. 
MIT  Press. 

L.  K.  Saul,  T.  Jaakkola,  and  M.  1.  Jordan  (1995).  Mean 
held  theory  for  sigmoid  belief  networks.  M.I.T.  Compu¬ 
tational  Cognitive  Science  Technical  Report  9501. 


4 


